Nature Biotechnology — Latest Matching Preprints

1

DAMPA - accelerated and simplified design of probe panels for targeted metagenomics using pangenome graphs

Payne, M.; Tam, K. K.-G.; Rockett, R. J.; Basile, K.; Bowden, R.; Sintchenko, V.; Kok, J.; Golubchik, T.

2026-05-22 infectious diseases 10.64898/2026.05.15.26352859 medRxiv

Top 0.1%

39.6%

Show abstract

Targeted metagenomics, where samples are enriched for multiple organisms of interest using oligonucleotide probes, is a highly efficient sequencing methodology that is becoming standard practice for genomics of viruses and complex polymicrobial samples. Efficient enrichment critically requires probes that capture both conserved and highly diverse genomic regions without loss of sensitivity, and with uniform representation in the sequencing pool. Design of optimal probesets poses a challenge: existing computational methods use k-mer hashing to reduce over-abundant sequences, but scalability and efficiency drop with increasing numbers of genomes, while diverse sequences remain under-represented. Here we show that incorporating evolutionary distance to compress probes via a graph-based representation of multiple genomes across species, together with k-mer hashing, reduces overrepresentation of conserved sequences, and yields more uniform coverage even of highly diverse loci. We make the method available in Dampa, an open-source tool that generates probesets in seconds on a standard laptop.

2

SpaceBio Knowledge Hub: A LiteratOmics Platform for Microgravity and Space Biology Research

Silva, J. C. F.; Vieira, A.; Chue Donahey, M. S.; Silva, S. M. d. C.; Veloso, T.; Lopes, A.; Sexson, N.; Barker, R.; Porterfield, D. M.; Silva, C. A.; Dias, R.

2026-07-14 scientific communication and education 10.64898/2026.07.13.737239 medRxiv

Top 0.1%

31.4%

Show abstract

Space biology literature is growing exponentially. Existing infrastructure has not kept pace with organizing, synthesizing, and disseminating this knowledge. We present SpaceBio SpaceBio Knowledge Hub (www.spacebio.space), an integrated digital ecosystem that combines artificial intelligence, real-time data integration, and open-access infrastructure to advance research, education, and collaboration in microgravity, space biology and space exploration. The platform applies AI-driven approaches including natural language processing, machine learning, and automated content generation to construct a semantic atlas of the field. The atlas reveals the hierarchical thematic organization underlying microgravity-induced biological responses, space mission infrastructure, planetary science, and astrobiology. As part of this effort, SpaceBio is moving toward the construction of a LiteratOmics framework for microgravity, and space biology a systematic, AI-enabled approach to mining, integrating, and structuring the primary literature generated by omics-driven spaceflight research, treating the scientific literature itself as a navigable data layer alongside genomic, transcriptomic, and proteomic datasets. Built on a scalable, cloud-based architecture with a user-centered interface, SpaceBio supports literature exploration, data integration, and knowledge discovery for researchers, educators, students, industry partners, and citizen scientists. The platform also functions as a community-building ecosystem. It integrates hands-on research initiatives, AI-generated educational content, pilot data science projects, and social responsibility programs that broaden participation without compromising scientific rigor. AI-enabled digital environments can transform fragmented literature into a navigable knowledge landscape. SpaceBio accelerates research productivity, strengthens STEM education, and supports the global space life sciences community as human space exploration enters in the most ambitious era.

3

CMS: Achieving Uniform and High-Quality Sequencing across Challenging Non-canonical Genomic Regions

Li, Q.; Liu, L.; Lin, Q.; Dan, X.; Jiang, Y.; Wei, Y.; Yang, M.; Peng, X.; Luo, W.; Wang, W.; Xu, D.; Huang, Z.; Sun, W.; Zhao, L.; Yan, Q.; Sun, L.; Feng, B.

2026-04-28 genomics 10.64898/2026.04.24.720553 medRxiv

Top 0.1%

26.5%

Show abstract

High-throughput sequencing is essential in modern biological research, yet low-complexity sequences remain challenging as they form structurally complex, non-canonical (non-B) DNA conformations that impede sequencing enzyme read-through. This leads to a long-standing trade-off: maximizing coverage introduces false positives (FP), while stringent filtering causes coverage loss and false negatives (FN). To address this, we developed CMS (Cross Mountains and Seas) on GeneMind sequencing platforms by optimizing its chemistry and enzymatic systems to traverse these secondary structures with high fidelity. Benchmarking across whole-genome (WGS) and whole-exome (WES) sequencing demonstrates that CMS addresses the trade-off by simultaneously enhancing both coverage uniformity and accuracy, notably achieving an approximately 100-fold reduction in low-coverage bins for WGS and a 70% reduction in FN insertions/deletions (INDELs) within complex non-B regions. Specifically, a synthetic G-quadruplex (G4) motif sequencing experiment demonstrates that CMS maintains a 1:1 strand ratio, effectively handling G4-induced biases where benchmarked platforms exhibit extensive depletion. These findings establish CMS as a reliable technology for the precise characterization of structural-challenging but functional-essential genome regions.

4

Spatially resolved, multimodal in vivo Perturb-seq using antibody-based cell hashing

Nevue, A. A.; Hartoularos, G. C.; De Valle, C.; Ramachandran, K.; Barron, J. J.; Lee, H.; Calleja Cervantes, M. E.; Bowness, J.; Velten, L.; Ricci-Tam, C.; Dobin, A.; Levy, M.; Ye, C. J.; Averbukh, I.; Lara-Astiaso, D.

2026-05-26 genomics 10.64898/2026.05.25.727765 medRxiv

Top 0.1%

22.3%

Show abstract

Large-scale perturbation screens have begun to map cell intrinsic gene-function relationships, yet how genes shape tissue architecture remains largely unexplored. To address this gap, we developed PerturbSpace, a novel approach that integrates CRISPR perturbations with spatially hashed single-cell multiomics. This approach enables the first high-throughput, spatially resolved Perturb-seq analysis across complex tissue architecture in vivo. Notably, PerturbSpace enables spatial transcriptome-wide perturbation readouts at organ scale and can be seamlessly integrated with orthogonal modalities. We combine PerturbSpace with surface proteomics and expressed lineage tracing barcodes to demonstrate multimodal compatibility. We use PerturbSpace to study the genetic determinants of tissue architecture. First, we map how 40 transcriptional regulators determine the size and lineage composition of colonies in the spleen during regenerative hematopoiesis. Second, we characterize immune-niche interactions in the liver by dissecting the extrinsic effects mediated by cytokine-secreting immune cells on their neighboring cells. Collectively, our work establishes PerturbSpace as a scalable and cost-effective approach for transcriptome-wide spatial profiling of whole cells while remaining compatible with the single-cell multiomics workflows that the field has already adopted at scale.

5

Whole-Proteome ESM-2 Embeddings Recover Taxonomy and Enable Geometry-Aware Triage of Foodborne Bacterial Genomes

Gutierrez, J.; Correa Alvarez, J.

2026-04-29 bioinformatics 10.64898/2026.04.26.720952 medRxiv

Top 0.1%

18.6%

Show abstract

Whole-genome sequencing (WGS) has transformed foodborne pathogen surveillance, yet time-sensitive decision-making remains constrained by computationally expensive alignment-centric workflows that scale poorly to outbreak volumes and lack built-in confidence signals. Using 21,657 GenomeTrakr-derived assemblies spanning nine food safety-relevant taxa, we represent each genome by mean-pooling per-protein embeddings from ESM-2 (480 dimensions). The resulting embedding space is dominated by taxonomic structure, exhibiting near-perfect neighborhood consistency for both species and a coarse species/pathotype-derived pathogenicity prior (mean homophily >0.99). Density-based clustering recovered species-coherent structure with high purity and bootstrap stability, while external agreement with the binary pathogenicity prior was only moderate, which is consistent with phylogenetic entanglement by design rather than embedding failure. As a within-genus stress test, kNN separates E. coli O157:H7 from non-pathogenic E. coli with [~]98% accuracy (5-fold CV), demonstrating that known pathotype annotations are preserved in the embedding geometry even among closely related genomes. We position this mean-pooling baseline relative to contextual genome language models that retain protein order or operon-scale context, and outline how embedding geometry (homophily, purity, outliers) can serve as a principled confidence layer in bio-surveillance-oriented triage pipelines.

6

ARCHIVE: Machine-Guided Design of an Efficient Open-Ended DNA Recording Device to Increase Resolution of Multiplexed Cell History Tracking

Rosenstein, A. H.; Garton, M.

2026-07-13 synthetic biology 10.64898/2026.07.10.737758 medRxiv

Top 0.1%

18.4%

Show abstract

Engineering cell-based devices to record events into DNA has potential both as a non-ablative research tool and in the clinic for enacting gene-circuit-based logic of cell therapies conditional on cell history. Whether as a means of understanding interactions on the single-cell level, or reconstructing histories of cellular events, a cellular DNA recording device has widespread utility, with prime editing-based methods at the forefront of this endeavor - notably peCHYRON. Yet, the resolution of such open-ended recording tools are inherently constrained by edit insertion efficiency and cannot yet capture RNA-polymerase II-transcribed signals, which represent a large segment of functionally-defined endogenous gene-regulatory architectures. Here we present ARCHIVE (Amplified Recording of Cellular Histories into Information-dense Vectors of Events), capable of integrating RNA-encoded signals into a predefined genomic recording locus with high efficiency. By utilizing deep-learning assisted prediction of prime-editing efficiency as a surrogate fitness model for generative in silico pegRNA evolution, we developed a recording device with an order-of-magnitude improvement in temporal resolution (efficiency of iterative message integration steps) compared to the state of the art - a capability we establish here at the level of constitutive promoter tracking. We expect ARCHIVE to serve as a launching point for more advanced mammalian synthetic-biology recording devices for both functional genomics and therapeutics research.

7

TDKC (Target Distilled K-mer Classifier): Ultrafast and Memory-Efficient Sequence Classification for Target Pathogen Diagnostics

Lee, S.; Agarwal, V.; O'Brien, W.; Eskin, E.

2026-06-06 bioinformatics 10.64898/2026.06.05.730319 medRxiv

Top 0.1%

18.2%

Show abstract

Metagenomic sequencing can identify pathogens from clinical samples without prior knowledge of the causative agent. Yet, as sequencing workflows scale to process thousands of multiplexed samples simultaneously, classifying these samples against massive reference databases creates a significant computational bottleneck. Furthermore, large-scale applications such as screening public sequence repositories remain computationally challenging. Existing metagenomic classifiers are designed for full-taxon classification, where the goal is to identify all organisms in a sample. However, many diagnostic applications focus on detecting a specific set of clinically relevant pathogens. This constraint can be exploited to significantly lower computational costs. Here we present TDKC (Target Distilled K-mer Classifier), a method for targeted metagenomic classification. TDKC constructs a compact index by distilling target-specific k-mers from a full-taxon reference database. When classifying clinical samples, TDKC uses 16.9-33.6x less memory and is 5.2-34.3x faster than per-read full-taxon and targeted classifiers (Kraken2, Centrifuger, CLARK), while maintaining high sensitivity and low false positive rates. Against the sketch-based profiler Sylph, TDKC remains 4.2x faster and uses 8.5x less memory. TDKC also supports per-k-mer accession tracking across over 3 million source accessions for downstream subtype analysis, and domain-level detection of bacteria, archaea, and viruses. By reducing the index to only the pathogens of interest, TDKC makes targeted pathogen detection feasible at scale.

8

MetaUmbra: Statistically Controlled Genome-Level Presence Inference from Metaproteomic Peptides

Wu, Q.; Ning, Z.; Zhang, A.; Cheng, K.; Figeys, D.

2026-05-04 bioinformatics 10.64898/2026.04.29.721689 medRxiv

Top 0.1%

18.2%

Show abstract

Taxonomic interpretation of metaproteomic peptides remains difficult because many peptide sequences are present in proteins from different organisms, reducing taxonomic specificity. Current peptide-centric workflows can report taxonomic summaries or taxon level confidence scores, but they do not provide formal statistical evidence that a taxon is present in the microbiome. Here we present MetaUmbra, a tool that derives genome-level statistical significance values from identified peptides. MetaUmbra builds theoretical peptide lists by in silico digestion of the taxon specific proteins and matches observed peptides against these references. It then combines a conservative significance estimate from unique peptides with a Monte Carlo based p-value for shared peptide evidence estimated under an empirical null model. In the defined community benchmark SIHUMIx, MetaUmbra identified the expected genomes without introducing false-positive genomes after embedding the SIHUMIx genomes in a large gut reference background. In the single strain benchmark Mix24X, all expected genomes were identified with the best statistical significances even after near neighbor and full background expansion. In a hamster gut genome panel, MetaUmbra further preserved an interpretable ranking of candidate genomes in a dense real-data setting. Together, these results show that MetaUmbra can statistically identify the presence of specific microbes in a complex microbiome while maintaining low false-positive calls. MetaUmbra therefore provides a practical framework for converting peptide evidence into genome-level statistical inference in metaproteomics.

9

Stitch-seq: Scalable CRISPR gene expression response profiling

Keer, F. R.; AlKhafaji, A. M.; Blainey, P. C.

2026-04-30 bioengineering 10.64898/2026.04.27.719216 medRxiv

Top 0.1%

18.1%

Show abstract

Single-cell profiling of genetic perturbations has expanded our ability to map causal links between genes and phenotypes; however, the high cost and technical complexity of current methods restrict systematic interrogation of dynamic cellular programs. Here, we present Stitch-seq, a high-throughput pooled functional genomics sequencing method enabling simultaneous capture of CRISPR perturbations and targeted gene and protein expression across millions of cells. Stitch-seq utilizes single-cell droplet-based overlap-extension reverse-transcription PCR reactions to physically link gene expression features of interest to perturbation identifiers without cell barcoding or extensive sequencing. We validated Stitch-seqs high fidelity using simplified models, benchmarked multi-omic Stitch-seq against single-cell RNA-sequencing in the MCF10A Epithelial-Mesenchymal Transition (EMT) model, and applied Stitch-seq to map transcriptional responses of MCF10A cells undergoing TGF-{beta}-induced EMT to perturbations across five time points. By efficiently delivering large-scale multi-omic gene expression readouts, Stitch-seq provides a powerful and accessible modality for the routine dissection of complex biological pathways.

10

Cohort-HMM marker recruitment with per-OG orthology QC for phylogenomic supermatrices

Nielsen, T. N.

2026-05-31 bioinformatics 10.64898/2026.05.27.728348 medRxiv

Top 0.1%

18.0%

Show abstract

OrthoFinders all-vs-all DIAMOND step systematically misses single-copy orthogroups (SC OGs) at deep taxonomic divergence: a marker recovered cleanly within a tightly defined cohort is dropped when the same marker is searched against phylum-broad metagenome-assembled genome (MAG) sets, because pairwise sequence similarity falls below DIAMONDs detection threshold even when the underlying ortholog is present. The result is biased dropout -- supermatrices that retain genomes near the cohort but lose genomes from the deeper, more diverged corners of the same phylum. We describe a two-stage cohort-HMM recruitment pipeline (per-OG profile HMMs built from cohort alignments, then hmmsearch against the broader proteome set) followed by an independent per-OG gene-tree QC step that classifies each recruited hit relative to the cohorts most recent common ancestor (MRCA) descendant set, with a per-MAG paralog-rate filter applied before supermatrix concatenation. We characterize the pipeline across three taxonomic ranks. At phylum scale (Omnitrophota, 97 cohort OGs, 714 NCBI MAGs), the recruitment recovers MAGs that the OrthoFinder-only supermatrix would otherwise drop, and the QC identifies 2 deep-peripheral MAGs -- divergent genomes whose per-OG tips repeatedly place outside the cohort MRCA descendant set despite being orthologs -- that the per-MAG filter removes. At family scale (Pelagibacteraceae, 146 cohort OGs, 366 NCBI MAGs) and at genus scale (Actinomarina, 289 cohort OGs, 23 NCBI MAGs), the per-tip paralog-candidate rate drops to 0.0 %. The pipeline addresses two independent failure modes. Cohort paralog density breaks strict-SC OG discovery at the cohort step (the family-rank case, where every candidate marker has at least one cohort species carrying multiple copies; the relaxed cohort criterion supplies the marker set and HMM recruitment disambiguates which copy each NCBI MAG contributes). DIAMOND-reach attrition breaks OG assignment for the most divergent NCBI MAGs (the phylum-rank case, where pairwise similarities fall below DIAMONDs detection threshold; HMM recruitment recovers the dropouts and the per-OG QC step filters residual paralog candidates). At genus rank both modes are inactive and OrthoFinder suffices directly; HMM recruitment runs but finds no new orthologs. Code and per-case data products are released as a community resource at Zenodo (DOI 10.5281/zenodo.20422348).

11

Cross-linked volumetric DNA microscopy for dense molecular-network phenotyping in intact tissue

Qian, N.; Yasser, R.; Yu, M.; Chang, H.; Weinstein, J. A.

2026-06-04 bioengineering 10.64898/2026.06.01.729154 medRxiv

Top 0.1%

17.5%

Show abstract

Resolving cellular phenotypes in full tissue context requires methods that can retain those cells physical neighborhoods, together with the identities of individual biomolecules, in intact three-dimensional specimens. We introduce cross-linked volumetric DNA microscopy (xVDM), in which unique molecular identifiers are seeded directly into the tissues protein matrix and linked by uniquely labeled DNA bridges to create a dense, DNA-encoded proximity network. Cell-scale molecular communities are then reconstructed directly from this network. xVDM produces denser molecular networks and broader transcriptome recovery than when these networks are nucleated by transcripts alone. xVDM maps out genetically annotated three-dimensional networks that map onto cell states and tissue regions in intact zebrafish embryos at 12, 18, and 24 hpf. Antibody-oligonucleotide conjugates extend the same framework to protein targets in human tonsil. xVDM provides a route to three-dimensional molecular phenotyping in intact specimens using only standard bench reagents and a DNA sequencer.

12

NinjaSeq: programmable restriction enzyme-based sequencing library preparation with random access for DNA data storage

Galminas, I.; Sabary, O.; Abraham, H.; Kaminskaite, K.; Cohen, T.; Gruodyte, V.; Alzbutas, G.; Yakhini, Z.; Palepsiene, R.; Zemaitis, L.; Yaakobi, E.; Juzenas, S.

2026-07-08 molecular biology 10.64898/2026.06.11.730843 medRxiv

Top 0.1%

16.2%

Show abstract

DNA data storage allows sequences to be defined without biological constraints, yet readout workflows still depend on generic end-repair/dA-tailing chemistry. We developed NinjaSeq, a type IIS restriction endonuclease library-preparation strategy that incorporates recognition sites into primer flanks, enabling digestion to generate adapter-compatible overhangs and eliminating the need for conventional end preparation. By combining this chemistry with constrained coding that excludes internal recognition motifs, NinjaSeq produced sequencing quality and decoding performance consistent with standard protocols while reducing reagent burden and simplifying processing, including compatibility with one-pot restriction-ligation. The same sequence-directed design also enables physical random access during library preparation: targeting file-specific flanking sites enriched a desired file from a mixed pool by about sixteen-fold in a proof-of-concept experiment. These results position NinjaSeq as a practical ONT readout approach for DNA data storage. HIGHLIGHTSO_LINinjaSeq replaces end-repair/dA-tailing with REases for nanopore sequencing C_LIO_LIConstrained encoding excludes recognition motifs to protect payloads from cleavage C_LIO_LINinjaSeq achieves decoding accuracy comparable to standard library preparation C_LIO_LIDesigning file-specific RRS enables random access during library preparation C_LI

13

The human gut virome is a non-redundant and clinically informative component of the microbiome

Yang, Y.; Huang, D.; Korzenik, J. R.; Weiss, S. T.; Liu, Y.-Y.; Sun, Z.

2026-05-15 bioinformatics 10.64898/2026.05.13.724676 medRxiv

Top 0.1%

15.5%

Show abstract

The gut virome represents a vast reservoir of genetic diversity with profound implications for human health, yet it remains the "dark matter" of the microbiome due to the staggering complexity of reproducible viral profiling. It remains fundamentally contested whether biologically informative virome signals can be robustly recovered from routine whole-metagenome sequencing (WMS), and to what extent these signals offer ecological insights independent of the bacteriome. Here we present VIP2B, a framework that leverages Type IIB restriction tags to extract multifaceted viral features (encompassing taxonomy, coverage, function, and phenotype) directly from bulk WMS data. Through extensive benchmarking across incomplete references, unseen genomes, and high bacterial or host background, we demonstrate that VIP2B achieved high precision and robust taxonomic concordance. By applying VIP2B to paired bulk and virus-like particle (VLP)-enriched datasets, we reveal a species-level overlap far greater than previously recognized, proving that standard bulk metagenomes contain a wealth of recoverable viral information. Analysis of 20 clinical cohorts demonstrates that coverage-, function-, and phenotype-resolved viral features consistently identify disease-associated signatures that escape taxonomic analysis alone, significantly improving diagnostic models over bacteriome-only approaches. Finally, we define two distinct gut virome community states at the population scale (n=6,090), characterized by divergent diversity profiles and health associations. Our findings establish the gut virome as a non-redundant, clinically actionable component of the human holobiont and provide the methodology necessary to transition microbiome research toward a truly multi-kingdom framework.

14

Auditable recovery of single-cell RNA-seq zeros with SPARE

Hernandez Galaz, S. F.; Pezoa, I.; Hernandez-Oliveras, A.; Lladser, A.; Martin, A. J.

2026-06-05 bioinformatics 10.64898/2026.06.02.729664 medRxiv

Top 0.1%

15.4%

Show abstract

In Single-cell RNA-seq, observed zeroes are the mix between biological absence and technical limitations. However, current evaluation metrics fail to distinguish between these two states, focusing on reconstruction accuracy rather than the biological validity of edits. We introduce SPARE, a partition-aware framework that audits the imputation process by cataloging observed zeros as edited, unchanged, or marker-vetoed prior to sequence reconstruction. Across multiple tissue benchmarks, SPARE successfully recovered masked expression while cataloging edit burden, marker leakage and raw-state movement. Protein and disease-context audits showed that recovered expression must be accepted, restricted or rejected endpoint by endpoint. SPARE reframes imputation as auditable zero editing rather than generic matrix completion.

15

NanoLabel: A fast and accurate real-time nanopore signal classifier

Mahajan, D.; Jain, C.; Kashyap, N.

2026-05-06 genomics 10.64898/2026.05.03.722500 medRxiv

Top 0.1%

15.4%

Show abstract

Oxford Nanopore Technologies adaptive sampling capability promises to reduce sequencing cost and turnaround time. At its core, adaptive sampling is a real-time classification problem that distinguishes reads originating from regions of interest. Direct signal-based classification approaches bypass the computational bottleneck of basecalling and can eliminate the need for powerful GPUs. However, operating directly on noisy raw signals remains challenging in real-time settings, where classification decisions must be made quickly. In this work, we propose NanoLabel, a new method for real-time classification of nanopore signals. We build NanoLabel on top of signal-based read mapping tool, RawHash2. We accelerate the classification workflow by mapping reads using only the target regions as the reference. To further improve accuracy, we train a lightweight classifier on mapping-derived features and introduce a data augmentation strategy to construct sufficiently large and class-balanced training datasets. We evaluate NanoLabel using publicly available real sequencing datasets from three human genomes (HG001, HG002, and HG005), while assuming a cancer gene panel as the target. Compared to directly mapping reads with RawHash2, we demonstrate 80 x improvement in the classification time and 0.10 - 0.25 units improvement in the F1 score.

16

Long-read single-cell genomics: resolving chimeras in multiple displacement amplification

McGowan, J.; Lipscombe, J.; Kilias, E. S.; Barker, T.; Catchpole, L.; Durrant, A.; Irish, N.; McTaggart, S.; Warring, S. D.; Gharbi, K.; Richards, T. A.; Hall, N.; Swarbreck, D.

2026-06-15 genomics 10.64898/2026.06.11.730069 medRxiv

Top 0.1%

15.3%

Show abstract

Multiple displacement amplification (MDA) enables whole-genome amplification from single cells, but introduces chimeric artifacts that severely compromise downstream analyses, particularly with long-read sequencing. Here, we systematically evaluate long-read PacBio HiFi sequencing of MDA amplified DNA from single cells using the model green alga Chlamydomonas reinhardtii. We show that MDA-derived libraries exhibit highly uneven coverage and extreme chimera rates impacting up to 70% of reads, leading to thousands of artefactual structural variants and misassemblies when assembled using algorithms designed for bulk sequencing. To overcome these challenges, we developed lrSAGA (long-read Single Amplified Genome Assembly), a novel tool to assemble long-read MDA sequencing datasets. Assemblies generated using lrSAGA are more complete, more contiguous, and have 75-95% fewer misassemblies compared to conventional assembly algorithms. Although overall contiguity is limited by MDA coverage dropouts, we demonstrate that up to 68% of the C. reinhardtii genome can be accurately assembled from just a single haploid cell. We further validated lrSAGA using published Oxford Nanopore and PacBio HiFi data from single or half Caenorhabditis elegans worms, generating accurate and highly complete assemblies. Applying our approach to single protist cells isolated from environmental water samples, we performed PacBio HiFi single-cell genome sequencing of four uncultivated microbial eukaryotes: an amoeboflagellate from the Naegleria genus, a flagellate from the Bodo genus, and two deep-branching flagellates from the enigmatic CRuMs supergroup, Collodictyon triciliatum and Diphylleia rotans. From single cells, we generated high-quality draft genome assemblies estimated to be 70-84% complete, demonstrating the potential of long-read single-cell genomics to unlock genome diversity from uncultivated microbial eukaryotes.

17

DynamicDemiLog: A Single Sketch for Ultrafast Similarity, Frequency, and Cardinality Estimation

Bushnell, B. J.

2026-06-16 bioinformatics 10.64898/2026.06.12.731986 medRxiv

Top 0.1%

15.2%

Show abstract

Probabilistic cardinality estimators (HyperLogLog), similarity sketches (MinHash), and frequency estimators (Count-Min Sketch) are fundamental approximate data structures that each target one primary problem. We present DynamicDemiLog (DDL), a sketch that unifies cardinality estimation, set similarity, containment, element frequency and composition in one tiny data structure built from a single pass over the input stream. Using an inverted index over 200,687 RefSeq sketches (159,567 organisms), DDL performs all-to-all sketch similarity comparison of the full database in 30 seconds (128 threads, indexed) -- over 375x faster per query than Mashs brute-force all-to-all comparison of 91,282 sketches, or 31x faster without the index, at double the sketch resolution. DDL extends the LogLog register with a mantissa: each register stores a floating-point-encoded hash value consisting of an integer exponent (the leading-zero count) and a fractional mantissa (the sub-leading-zero bits), rather than the integer leading-zero count alone. This preserves enough hash information for meaningful register-by-register comparison -- a property that standard 6-bit registers lack -- while improving on LogLogs cardinality estimation machinery, including DynamicLogLogs early exit mask for high-throughput streaming. With a default 10 mantissa bits (16-bit registers, 2,048 buckets, 4 KB), DDL achieves a per-register false-match rate of 0.018% on unrelated random same-size sets (compared to 17.0% for LL6, a basic HyperLogLog implementation), enabling Weighted Kmer Identity (WKID), Average Nucleotide Identity (ANI), containment, and completeness estimation from register comparison alone. A 16-bit per-register observation counter provides element frequency information at trivial additional computation cost, and an additional byte tracks element composition (GC content, for biological data). Furthermore, DDLs high-specificity registers enable an inverted index structure (DDLIndex) that answers similarity queries against a database of N sketches in O(B + M) time, where M is the number of matching index entries, compared to O(N xB) for pairwise comparison. DDL achieves a 930x reduction in false register matches compared to LL6 (Section 11.1), accurately estimates ANI between full and partial genomes down to approximately 79% identity (at k=25, B=2,048), and maintains near-zero spurious similarity on unrelated inputs -- all at similar construction speed to LL6, and 3.5x faster than SetSketch.

18

Scalable genotyping in fixed transcriptomes resolves clonal heterogeneity via single-cell sequencing

Blattman, S. B.; Maslah, N.; Varela, A. A.; Kumpaitis, K.; Nalbant, B.; Snopkowski, C.; Mariani, M.; Kida, L. C.; Takizawa, M.; Ratnayeke, N.; Yu, K. K. H.; Fernandes, S.; Mousavi, N.; Borgstrom, E.; Vallejo, D.; Boghospor, L.; Xin, R.; Mignardi, M.; Wu, S.; Scarlott, N.; Delgado-Rivera, L.; Kumar, P.; Krishnan, S.; Giraudier, S.; Kiladjian, J.-J.; Howitt, B. E.; Kohlway, A.; Lund, P.; Pe'er, D.; Chaligne, R.; Lareau, C. A.

2026-05-10 genomics 10.64898/2026.04.11.717967 medRxiv

Top 0.1%

15.1%

Show abstract

Despite the promise of single-cell transcriptomics for understanding cell states in heterogeneous populations, widely used platforms have limited ability to link transcriptional states to somatic mutations within the same cells. Here, we introduce Genotyping in Fixed Transcriptomes (GIFT) for the simultaneous detection of large numbers of targeted genetic variants with whole transcriptome profiles in single cells. The core innovation of GIFT is a rationally designed gapfilling reaction between adjacent single-stranded DNA (ssDNA) probes that barcodes native transcript sequence to enable highly-specific targeted mutation detection. GIFT achieves greater than 99% genotyping accuracy and flexible capture of hundreds of mutations per cell, including in formalin-fixed, paraffin-embedded (FFPE) tissue, enabling clonal lineage tracing in heterogeneous settings. We demonstrate the unique scalability of GIFT by profiling more than 700,000 cells from 35 donors with myeloproliferative neoplasms (MPN), revealing mutation-dependent hematopoietic responses to systemic inflammation associated with the characteristic JAK2V617 mutation, including an allelic dose gradient of interferon-associated transcriptional programs and priming of hematopoietic stem cells that develop into divergent disease states. The technical advantages of GIFT enable direct resolution of genotype-to-phenotype relationships via clonal tracing with comprehensive cell-state measurements at single-cell resolution.

19

Massively multiplex multimodal chemical screens at single-cell resolution

Chen, K. Y.; Lopez, R.; Eraslan, B.; Hata, M.; Takeshima, Y.; Makino, K.; Kibayashi, T.; Ichiyama, K.; Biton, A.; Huetter, J.-C.; Kundaje, A.; Pritchard, J. K.; Regev, A.; Sakaguchi, S.

2026-05-27 genomics 10.64898/2026.05.23.727396 medRxiv

Top 0.1%

15.1%

Show abstract

Recent applications of scRNA-seq for massively multiplexed chemical screens have enabled comprehensive profiling of drug responses at unprecedented scale and resolution. However, current assays remain limited to RNA readouts, lacking information on other phenotypic and mechanistic layers such as chromatin accessibility, protein abundance and post-translational modifications. Here, we introduce a scalable framework for multimodal chemical screens, combining parallel small-molecule perturbations with multimodal readouts. We extend existing experimental platforms into icCITE-plex and DOGMA-plex, enabling joint profiling of RNA, protein, and epigenomic responses to thousands of chemical perturbations in parallel. To systematically decode the regulatory circuitry underlying these responses, we develop MoCAVI, a contrastive analysis framework that disentangles the effect of small molecules from control variation in multimodal measurements, and PERCISTRA, a pipeline that infers causal links between chromatin accessibility and gene expression. Applied across ~410,000 primary T cells under ~2,800 conditions, our approach resolves compound-specific mechanisms, highlights off-target effects, and links chromatin accessibility changes to transcription factor networks in primary T cells. Our results establish a generalizable platform for profiling and analyzing cellular responses to chemical perturbations across multiple modalities.

20

Biological foundation models illuminate annotation blind spots in evolutionarily divergent genomes

Lanser, T. B.; Caldwell, S. K.; Pacheco, G. A.; Chen, J. W.; Saghaei, S.; Hassan, M.; Kronrod, M.; Wesemann, D. R.; Frost, H. R.

2026-05-16 bioinformatics 10.64898/2026.05.15.724572 medRxiv

Top 0.1%

15.0%

Show abstract

Chromosome-scale assemblies are increasingly available for non-model organisms, but functional annotation remains limited when deep evolutionary divergence erodes primary amino-acid sequence identity even though protein structural similarity can remain conserved. We present a hybrid annotation framework that decouples gene-model discovery from cross-species similarity assignment by combining Evo2-based ab initio prediction of exon-intron structures with ESM-2 protein-embedding-based structural similarity mapping. Applied to the sea lamprey, the framework derives high- or medium-confidence cross-species similarity assignments for 73,485 Evo2-derived translated protein models, including 35,395 high-confidence calls, and expands the deduplicated structural catalog to 31,286 loci, including 20,871 additions absent from the Ensembl baseline. A joint alignment-structure classification identifies 21,391 structurally supported catalog loci that a fixed human DIAMOND protein search does not confidently assign on its own, including 21,184 loci with no detectable human protein-sequence match and 207 loci with only low-confidence matches in the classical 20-30% amino-acid-identity twilight zone. These rescue-space totals describe catalog loci rather than validated one-to-one human-absent genes. In a single-cell RNA sequencing application, a stricter UTR-aware Ensembl+Evo2 reference improves gene recovery and expands the interpretable feature space of the lamprey immune compartment relative to the Ensembl baseline. This enables more resolved annotation of four transcriptionally defined immune cell states, including VLRA+-associated T-like and VLRB+-associated B-like programs together with oxidative iron-handling and iron-associated VLR-linked states. Together, these results show that structural protein signal often persists beyond the limits of pairwise sequence alignment and that an embedding-based annotation layer can extend that signal to improve downstream comparative and single-cell analyses in evolutionarily divergent genomes.